First I explore the basic structure of red wine data set and get its summary.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
There are 1599 red wine observations in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality).
75% of red wines have volatile acidity equal to or less than 0.64 g/dm^3. The minimum value of citric acid is 0.0 g/dm^3 and 25% of red wines have citric acid equal to or less than 0.09 g/dm^3. Residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates are equal to or less than 2.6 g/dm^3, 0.09 g/dm^3, 21.0 mg/dm^3, 62.0 mg/dm^3, and 0.73 g/dm^3, respectively, for 75% of red wines. Quality of red wines only takes integer values, which means it can be treated as an ordered variable. The worst, median, and best quality for red wines are 3, 6 and 8, respectively.
Next I will look at the distributions of all the variables (fixed acidity, volatile acidity, …, and quality) except “X” by plotting histograms to get an overview about what the data set is like.
It is notable that the distributions of fixed acidity and volatile acidity are not symmetrical but both skewed right.
Here I used a small binwidth to better understand the distribution of citric acid. The distribution shows a very large peak at around zero. I wonder if these data are really correct. By using “count” function, I found citric acid is zero for 132 (8.3%) red wine data. Most red wines have citric acid less than 0.75 g/dm^3 but there is an outlier at 1.0 g/dm^3. The calculations are shown below.
## Source: local data frame [2 x 2]
##
## citric.acid == 0 n
## 1 FALSE 1467
## 2 TRUE 132
## Source: local data frame [2 x 2]
##
## citric.acid == 1 n
## 1 FALSE 1598
## 2 TRUE 1
Resume plotting histograms…
The distribution of residual sugar is also skewed right. Transformed the long tail data to better understand it.
Used a small binwidth and changed the upper limit of chlorides to better understand the long-tail distribution. Most red wines have chlorides less than 0.2 g/dm^3 but there are some outliers.
Free and total sulfur dioxide also show right-skewed distributions. Used a small binwidth to better understand the distribution of free sulfur dioxide. Free sulfur dioxide is less than 60 mg/dm^3 for most red wines but there are some outliers. Total sulfur dioxide is less than 150 mg/dm^3 for most red wines but there are some outliers.
Density and pH are almost normally distributed, while sulphates and alcohol are again skewed right. Sulphates is less than 1.5 g/dm^3 for most red wines but there are some outliers. Used a small binwidth to better understand the distribution of alcohol.
Quality can be regarded as an ordered variable between 3 to 8. As shown in the calculations below, 1319 out of 1599 red wines (82%) have intermediate quality of 5 or 6, while only 28 (1.8%) have quality of 3 or 8. Taking all these findings into account, I wonder if a linear model using some of the features can be a good method to predict the red wine quality, or we need to consider some other ways.
## Source: local data frame [2 x 2]
##
## quality == 5 n
## 1 FALSE 918
## 2 TRUE 681
## Source: local data frame [2 x 2]
##
## quality == 6 n
## 1 FALSE 961
## 2 TRUE 638
## Source: local data frame [2 x 2]
##
## quality == 3 n
## 1 FALSE 1589
## 2 TRUE 10
## Source: local data frame [2 x 2]
##
## quality == 8 n
## 1 FALSE 1581
## 2 TRUE 18
In addition, I’m interested in the asymmetric distribution of red wine quality (there are more red wines with quality = 7 than those with quality = 3). I wonder if this has anything with the right-skewed distributions of fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol.
There are 1599 red wine observations in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). Quality is an ordered variable with min = 3 and max = 8.
Other observations:
The distributions of fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol are skewed right.
About 8.3% of red wines have zero citric acid.
There are some outliers in the data of citric acid, chlorides, free and total sulfur dioxide, as well as sulphates.
Most red wines have a quality of 5 or 6 and only less than 2% have a quality of 3 or 8.
The distribution of quality is asymmetric where there are more red wines with quality = 7 than those with quality = 4.
The main features in the data set are fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, suplhates, alcohol and quality. I’d like to know which variables contribute most to the quality of red wine. By looking at the description about the Red Wine Quality dataset provided by Cortez et al., I suspect volatile acidity, citric acid, residual sugar, total sulfur dioxide could be main features to predict the quality.
I suspect the asymmetric distribution of quality has something to do with the right-skewed feature of variables like volatile acidity or alcohol.
No.
About 8.3% of red wines have zero citric acid. Although it’s not clear whether if those data are correct or due to measurement errors, I will consider removing them as outliers in the future investigation to see if it helps predict quality of red wine more clearly.
I’ll start the bivariate analysis by calculating correlation coefficients for each pair of variables.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
Although we need to investigate 2D scatter plots to look at the correlation between two variables in detail, it is notable that the following pairs have relatively large correlation coefficients:
quality and volatile acidity (negative)
quality and citric acid, sulphates and alcohol (positive)
density and alcohol (negative)
density and residual sugar (positive)
pH and citric acid (negative)
free sulfur dioxide and total sulfur dioxide (positive)
The pair plot below provides an overview on the relationships between the variables in the data set. I omitted variable “X” and add “smooth” on the plots.
I want to look closer at the plots involving quality and other variables: fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates, alcohol, and so on.
The first one is quality vs. fixed acidity.
Since I found a scatter plot doesn’t help much to see the relationship between quality and fixed acidity, I created a box plot instead. I’ll be using the same methods for other variables in later investigations. Here we see a weak trend that red wines with better quality have a larger median fixed acidity, but it is not clear.
Resume boxplots…
Red wines with better quality have a smaller median volatile acidity. When volatile acidity is larger than 0.8 g/dm^3, the quality of red wines can hardly be 7 or better.
In the second plot above I removed data with citric acid = 0.0. We can see a trend that red wines with better quality have a larger median citric acid.
In the second plot above I limited the range of residual sugar to (0, 8) in order to better understand the change in the median values. No clear relationship between residual sugar and quality.
I modified the range of chlorides in the second plot above. We see a weak trend that red wines with better quality have a smaller median chlorides, but it is not clear.
There is a nonlinear relationship between quality and the median free sulfur dioxide.
Again, there is a nonlinear trend between quality and the median total sulfur dioxide. I’ll look at this in more detail afterward.
We see a trend where red wines with better quality have a smaller median density.
We see a trend where red wines with better quality have a smaller median pH.
Red wines with better quality have a larger median sulphates.
Red wines with better quality have a larger median alcohol. If alcohol is smaller than 10 % by volume, the quality of red wines gets mostly 6 or worse.
So far I found the quality of red wines has a relatively clear relationship with volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. But here I also want to look at the relationships between selected pairs of variables to know if there are essentially the same values that we should not use simultaneously.
The following seven plots are for the pairs including volatile acidity.
It is clear that volatile acidity negatively correlates with citric acid.
Next, the pairs including citric acid.
Citric acid negatively correlates with pH.
Next, the pairs including free sulfur dioxide.
Free sulfur dioxide positively correlates with total sulfur dioxide.
Next, the pairs including total sulfur dioxide.
Unable to find any clear correlations.
Next, the pairs including density.
Density and alcohol are negatively correlated.
Next, the pairs including pH.
None of the two shows a clear correlation.
The last one is sulphates vs. alcohol, which neither shows a clear relationship.
Red wines with better quality tend to have a smaller volatile acidity, larger citric acid, smaller density, smaller pH, larger sulphates, and larger alcohol.
If volatile acidity is larger than 0.8 g/dm^3, the quality of red wines can hardly be 7 or better.
If alcohol is smaller than 10 % by volume, the quality of red wines gets mostly 6 or worse.
Quality of red wines has a nonlinear relationship with free sulfur dioxide and total sulfur dioxide.
There are also weak trends where red wines with better quality tend to have a larger fixed acidity and a smaller chlorides.
Volatile acidity negatively correlates with citric acid.
Citric acid negatively correlates with pH.
Free sulfur dioxide positively correlates with total sulfur dioxide.
Density and alcohol are negatively correlated.
The quality of red wine has the strongest positive correlation with alcohol (correlation coefficient is 0.48). The negative correlation between quality and volatile acidity is also strong (correlation coefficient is -0.39).
Here I will see how quality of red wines distribute on a scatter plot defined by two of the features. I use volatile acidity, citric acid, sulphates, and total sulfur dioxide, in addition to alcohol as the main features for this investigation since the other features have a correlation with at least one of these five or have been found to have little to do with quality (like fixed acidity or chlorides).
I included citric acid in the main features though it correlates with volatile acidity. This is because while citric acid correlates also with pH, the correlation between volatile acidity and pH was not clear enough, which made me doubtful about that citric acid is largely dependent on volatile acid.
We see red wines with better quality are distributed in the region with small volatile acidity and large alcohol, while worse ones are in the region with large volatile acidity and small alcohol. This suggests that we could build a model to classify red wines by some clustering techniques.
In the second plot I removed data with citric acid is zero as outliers. We see better wines have larger alcohol and citric acid.
Red wines with better quality have larger alcohol and sulphates.
This is a little complicated plot. In general better wines have larger alcohol. But at the same time, many extreme values (best and worst wines) have small total sulfur dioxide. This is consistent with the nonlinear relationship between quality and total sulfur dioxide discussed in the Bivariate Plots Section. I don’t go further into this point since it’s not a clear trend.
All the plots above taken into account, I think alcohol and volatile acidity are the best features to predict the quality of red wines. We could apply some classification methods on the alcohol vs. volatile acidity scatter plot to categorize red wines into different quality.
Here I also try a linear model using alcohol, volatile acidity, citric acid, and sulphates to predict the quality of red wines.
#Examine a linear model to predict quality
m1 <- lm(formula=quality ~ alcohol, data=rdw)
m2 <- lm(formula=quality ~ alcohol + volatile.acidity, data=rdw)
m3 <- lm(formula=quality ~ alcohol + volatile.acidity + citric.acid, data=rdw)
m4 <- lm(formula=quality ~ alcohol + volatile.acidity + citric.acid +
sulphates, data=rdw)
mtable(m1, m2, m3, m4)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rdw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = rdw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid,
## data = rdw)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## sulphates, data = rdw)
##
## =========================================================
## m1 m2 m3 m4
## ---------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.055*** 2.646***
## (0.175) (0.184) (0.194) (0.201)
## alcohol 0.361*** 0.314*** 0.314*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.343*** -1.265***
## (0.095) (0.114) (0.113)
## citric.acid 0.068 -0.079
## (0.103) (0.104)
## sulphates 0.696***
## (0.103)
## ---------------------------------------------------------
## R-squared 0.227 0.317 0.317 0.336
## adj. R-squared 0.226 0.316 0.316 0.334
## sigma 0.710 0.668 0.668 0.659
## F 468.267 370.379 246.976 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1621.596 -1599.093
## Deviance 805.870 711.796 711.603 691.852
## AIC 3448.114 3251.628 3253.192 3210.186
## BIC 3464.245 3273.136 3280.078 3242.448
## N 1599 1599 1599 1599
## =========================================================
The variables in this linear model account for 33.6% of the variance in the quality of red wines.
A scatter plot alcohol vs. volatile acidity by quality suggests that we can build a good classification model that categorize red wines by quality. I wasn’t able to create any actual models but one simple example is as follows:
if volatile acidity > = 0.8 g/dm^3: quality is 3 or 4
else if alcohol <= 10 % by volume: quality is 5
else if alcohol >= 12 % by volume or volatile acidity <= 0.4 g/dm^3: quality is 7 or 8
else: quality is 5 or 6
Similar models could be built also for the alcohol vs. citric acid plot or the alcohol vs. sulphates plot.
If alcohol is equal to or less than 10 % by volume, quality of red wines mostly be 6 or worse, regardless of the other features. This is consistent with what I observed from the boxplot in the Bivariate Plots Section.
Yes. I created a linear model using quality, alcohol, volatile acidity, citric acid, and sulphates.
The variables in the linear model account for 33.6% of the variance in the red wine quality. The addition of the citric acid to the model did not improve the R^2 value perhaps due to the variable’s (negative) correlation with the volatile acidity.
Quality can be regarded as an ordered variable between 3 to 8. The asymmetric distribution of red wine quality (there are more red wines with quality = 7 than those with quality = 4) is perhaps due to the right-skewed distributions of the main features that contribute most to the quality like volatile acidity or alcohol.
Red wines with better quality have a larger median alcohol. If alcohol is smaller than 10 % by volume, the quality of red wines are mostly 6 or worse.
We see red wines with better quality are distributed in the region with small volatile acidity and large alcohol, while worse ones are in the region with large volatile acidity and small alcohol. This suggests that we could build a model to classify red wines by some clustering techniques.
The data set I used contains 1599 red wines with 11 variables on the chemical properties of the wine. I explored the quality of red wines across different variables and found alcohol and volatile acidity, as well as citric acid and sulphates are the main features that contribute to the quality. The other features either correlated with at least one of the main features or did not have much impact on the quality. I struggled building a model to actually predict the quality of red wines using these features because the output, i.e. quality, is not a continuous but a discrete ordered variable. Even though I tried to build a linear model, it only explained 34% of the red wine quality. Then I proposed to apply some classification techniques for the scatter plots like alcohol vs. volatile acidity to categorize red wine data into different quality ranges.
One suggestion for the future investigation: So far even a good classification method won’t be able to predict clearly the quality of red wines in the very good quality range (such as quality = 7 and quality = 8) or in the very poor quality range (such as quality = 4 and quality = 3). This is partly due to a lack of red wine data in those quality ranges so I would consider increasing the number of very good and very poor red wine samples. Furthermore, it is critical but quite hard to select the right features to predict the quality of red wines since we don’t know exactly how people actually sense and feel the taste of food and drinks. I’m interested in how the state-of-the-art machine learning techniques like deep neural network, incorporating much more variables related to red wines, could help improve the accuracy of quality prediction.